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Abstract 

A generalized ensemble model (gEnM) for document ranking is proposed in this paper. The gEnM linearly combines basis 
document retrieval models and tries to retrieve relevant documents at high positions. In order to obtain the optimal linear 
combination of multiple document retrieval models or rankers, an optimization program is formulated by directly maximizing 
the mean average precision. Both supervised and unsupervised learning algorithms are presented to solve this program. For the 
supervised scheme, two approaches are considered based on the data setting, namely batch and online setting. In the batch setting, 
we propose a revised Newton’s algorithm, gEnM.BAT, by approximating the derivative and Hessian matrix. In the online setting, 
we advocate a stochastic gradient descent (SGD) based algorithm—gEnM.ON. As for the unsupervised scheme, an unsupervised 
ensemble model (UnsEnM) by iteratively co-learning from each constituent ranker is presented. Experimental study on benchmark 
data sets verifies the effectiveness of the proposed algorithms. Therefore, with appropriate algorithms, the gEnM is a viable option 
in diverse practical information retrieval applications. 

Index Terms —ensemble model, mean average precision, document ranking. Information Retrieval, nonlinear optimization 


I. Introduction 

Ranking is a core task for Information Retrieval (IR) in 
practical applications such as search engines and advertising 
recommendation systems. The aim of ranking task is to 
retrieve the most relevant objects (documents, for example) 
with regard to a given query. With the continuous growth 
of information in modern world wide webs, this task has 
become more challenging than ever before. In the ranking 
task, the general problem is the over-inclusion of relevant 
documents that a user is willing to receive Cl. During the last 
decade, a large quantity of models has been proposed to solve 
this problem. In general, those models are evaluated by two 
IR performance measures, namely Mean Average Precision 
(MAP) and Normalized Discounted Cumulative Gain (NDCG) 
El. Compared to the framework in which models are proposed 
and then tested by IR measures, the approaches of directly 
optimizing IR measures have been showing more effective El, 
0. These approaches apply efficient algorithms to solve the 
optimization problem where the objective function is one of 
the IR measures. 

Structured SVM is a widely used framework for optimizing 
the bound of IR measures. Examples include SVM™^ El 
and SVM”“'‘^^ 0 . Many other methods, such as Softrank || 71 , 
El, hrst approximate the ranking measures through smooth 
functions and then optimize the surrogate objective functions. 
Yet, the drawbacks of those methods has been shown in two 
aspects: a) the relationship between the surrogate objective 
functions and ranking measures was not sufficiently studied; 
and b) the algorithms resolving the optimization problems are 
not trivial to be employed in practice 0. Recently, a general 
framework that directly optimizes of IR measure has been 
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reported 0. This framework can effectively overcome those 
drawbacks. However, it only optimizes the IR measure of one 
ranker, and the information provided by other rankers is not 
fully utilized. 

In classification area, an ensemble classiher that linearly 
combines multiple classifiers has been successfully proved 
to perform better than any of the constituent classihers. A 
number of sophisticated algorithms have been proposed for 
obtaining the ensemble classiher such as AdaBoost ||9l. Thus, 
the hypothesis that the performance can be improved by 
combining multiple rankers may be true as well. As a matter 
of fact, AdaRank ifTol , ifTTl and LambdaMART are two well- 
known models in IR area utilizing AdaBoost. The AdaRank re¬ 
peatedly constructs weak rankers (features) and hnally linearly 
combines into a strong ranker with proper weights assigned 
to the constituent rankers. However, the drawback of the 
AdaRank is the inexplicit theoretical justihcation and deter¬ 
mination of the iteration number. While the LambdaMART 
enjoys the theoretical advantage of directly optimizing IR 
measures by linearly combining any two rankers, it cannot 
be extended to multiple rankers straightforwardly. In those 
previous studies, the direct optimization of NDCG is well- 
studied but the direct optimization of MAP are rarely tackled, 
to the best of our knowledge. The main difficulty of directly 
optimizing MAP is that the objective function defined by MAP 
is nonsmooth, nondifferentiable and nonconvex. Ensemble 
Model (EnM) Ql solves this problem by using boosting 
algorithm and coordinate descent algorithm. However, the 
solutions cannot be theoretically guaranteed to be optimal, or 
even local optimal. 

In this paper, we propose a generalized ensemble model 
(gEnM) for document ranking. It is an ensemble ranker that 
linearly combines multiple rankers. By appropriate adjust¬ 
ments to the weights for those constituent rankers, one may 
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improve the overall performance of document ranking. To 
compute the weights, we formulate a constrained nonlinear 
program which directly optimizes the MAP. The difficulty of 
solving this nonlinear program lies in the nondifferentiable and 
noncontinuous objective function. To overcome this difficulty, 
we first introduce a differentiable suiTogate to approximate 
the objective function, and then formulate an approximated 
unconstrained nonlinear program. 

Both supervised and unsupervised algorithms are employed 
for solving the nonlinear program. In the supervised scheme, 
batch and online data settings are considered. These schemes 
and settings are designed for different IR environments. For the 
batch setting, the algorithm gEnM.BAT is a revised Newton’s 
method by approximating the derivative and Hessian matrix. 
As for the online scheme, an online algorithm, gEnM.ON, 
is proposed based on stochastic gradient descent algorithms. 
The gEnM.ON is the first online algorithm for obtaining 
an ensemble ranker, to the best of our knowledge. In the 
unsupervised scheme, an unsupervised gEnM (UnsEnM) in¬ 
spired by iRANK lfT3]l is proposed. The UnsEnM utilizes 
the collaborative information among constituent rankers. The 
advantage of UnsEnM over the iRANK is that it is applicable 
to any number of constituent rankers. Compared to the EnM, 
the generalized version gEnM differs in three aspects: 

1) The assumption for EnM is relaxed for gEnM; 

2) the batch algorithms proposed for gEnM performs better; 

3) both online algorithm and unsupervised algorithm are 
proposed for gEnM whereas only batch algorithm for 
EnM. 

The remainder of this paper is organized as follows. In 
the next section, the problem of direct optimization of MAP 
is described and formulated. Also, the approximation to this 
problem is provided as long as the theoretical proofs. The 
algorithms, including gEnM.BAT, gEnM.ON and UnsEnM, 
are presented in Section 5. The computational results of 
the proposed algorithms tested on the public data sets are 
demonstrated in Section 6. The last section concludes this 
paper with discussions. 

II. Generalized Ensemble Model 

A. Problem Description 

Consider the task of constructing a linear combination of 
rankers that result in better performance than each constituent. 
We call this linear combination the ensemble ranker or en¬ 
semble model hereinafter. Given a search query in this task, a 
sequence of documents is retrieved by the constituent rankers 
according to the relevance to the query. The relevance is 
measured by the ranking scores calculated by each ranker. 
Eor explicit description, let scores denote the ranking score 
or relevant score calculated by the ranker. With appropriate 
weights weightk over those constituent rankers, the ranking 
scores score of ensemble ranker is defined by linearly sum¬ 
ming the weighted constituent ranking scores, i.e., 

score =weighti ■ scorei -f weight 2 ■ score 2 -l- 
• • • -f weightk ■ scores 


where the weights satisfy weighti > 0 and weighti -f 
weight 2 -f • ■ • -f weightk = 1. The documents ranked by the 
ensemble ranker are thus ordered according to the ensemble 
ranker scores. Our goal is to uncover an optimal weight vector 

weight = (weighti,weight 2 T--,weightkY' 

with which more relevant documents can be ranked at high 
positions. 

A toy example shown in Table U describes this problem. 
According to the ranking scores, the ranking lists returned 
by Ranker 1 and 2 are {2,1,3} and {3,1,2}, respectively, and 
the corresponding MAPs are 0.72 and 0.72. In order to make 
full use of the ranking information provided by both rankers, 
a conventional heuristic is to sum up ranking scores (i.e., 
use uniform weights, (0.5,0.5)), which generates Ensemble 
1 with MAP equal to 0.72. Obviously, this procedure is not 
optimal since we can give arbitrary alternative weights that 
generate a better precision. Eor example. Ensemble 2 uses 
weights (0.7, 0.3) so as to result in higher MAP, i.e., 0.89, 
as listed in the table. 

TABLE I: A toy example. The values in the mid-three rows 
represent the ranking scores given an identical query. The 
rankers are measured by MAP, as listed in the fifth row. The 
ranking scores of Ensemble 1 and 2 are defined by 0.5*Ranker 
l-F0.5*Ranker2 and 0.7*Ranker l-F0.3*Ranker 2, respectively. 
The relevant document list is assumed to be {2,3}. 



Ranker 1 

Ranker 2 

Ensemble 1 

Ensemble 2 

Document 1 

0.35 

0.2 

0.55 

0.305 

Document 2 

0.4 

0.1 

0.5 

0.31 

Document 3 

0.25 

0.7 

0.95 

0.385 

MAP 

0.72 

0.72 

0.72 

0.89 


This toy example implies that there exist optimal weights 
assigned for the constituent rankers to construct an ensemble 
ranker. Different from proposing new probabilistic or nonprob- 
abilistic models, this ensemble model motivates an alternative 
way for solving ranking tasks. In order to formulate this task 
as an optimization problem, the metric—MAP—is used as 
the objective function since it reflects the performance of 
IR system and tends to discriminate stably among systems 
compared to other IR metrics na. Therefore, our goal is 
changed to calculate the weights with which the MAP is 
maximized. In the following, we will describe and solve this 
problem mathematically. 

B. Problem Definition 

Let I? be a set of documents, Q a set of queries and $ a set 
of rankers. \Di\ denotes the relevant document list, dj G D 
the document associated with relevant document in 
Di, qi & Q the 2 *^ query and fk G ^ the ranker. L 
represents the number of queries, \Di\ the number of relevant 
documents associated with qi and the number of rankers. 
The ensemble ranker is defined as 77 = X^flfr which 

linearly combines constituent rankers with weights a’s. We 
assume the relevant documents have been sorted in descending 
order according to the ranking sores. On the basis of these 
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notations and the definition of MAP, the aforementioned 
problem can be formulated as; 


max 


s.t. 



j 

R{dj,H) 


= 1 


k=l 


0 < Ofc < l,fc = 1,2, 


(PI) 


where R[dj,H) represents the ranking position of document 
dj given by the ensemble model H. In this constrained 
nonlinear program, a) the objective function is a general def¬ 
inition of MAP; and b) the constraints indicate that the linear 
combination is convex and that the weights can be interpreted 
as a distribution. Since the position function R{dj,H) is 
defined by the ranking scores, it can be written as 


R{dj,H) = l+ Msd.AW <0} (1) 

d^D ,d^dj 


where Sx,y{H) = Sx{H) — Sy{H) and l{sx,y{H) < 0} is an 
indicator function which equals 1 if Sx,y{H) < 0 is true and 0 
otherwise. Here, Sx{H) denotes the ranking score of document 
X given by ensemble model H and Sx,y{H) the difference of 
the ranking scores between document x and y. Since Sx{H) 
is linear with respect to the weights, it can be rewritten as 


Sx{H) = Sx 


(K, \ 

I 1 


K,!, 

k=l 


( 2 ) 


where Sx{ 4 >k{<li)) denotes the relevant score of document x 
for query qi calculated by model (j>k. 

Here, we give an example plot that illustrates the graph of 
the objective function. This example employed the MED data 
set with the settings identical to those in ifT^ except that only 
two constituent rankers, LDI and pLSI, were used to comprise 
the ensemble ranker for plotting purpose. The weights were 
restricted to the constraints in Problem PI with the precision 
of three digits after the decimal point. In detail, the objective 
function was evaluated by setting ai for LDI and a 2 for pLSI, 
where ai + a 2 = 1, and ai increased from 0 to 1 with a 
step size of 0.001. Figure [T] shows a partial of the graph of 
objective function. From this plot, it is clearly observed that 
a) the objective function is highly nonsmooth and nonconvex; 
and b) there are numerous local optimums in the objective 
function. Though the differentiability is not obvious in this 
graph, the position function implies that the objective function 
is nondifferentiable in terms of weights. Therefore, the general 
gradient-based algorithms, such as Lagrangian Relaxation and 
Newton’s Method, cannot be applied to this problem directly 
to find the optimum, even local optimums 13 . 

From this analysis of the objective function, the position 
function plays an important role in the differentiability. Thus, 
we will discuss how to approximate it with a differentiable 
function and how to solve this optimization Problem PI in the 
next two sections. 



Fig. 1; An illustrated example of the objective function with 
two constituent rankers in Problem PI. 


HI. Approximation 


In this section, we propose a differentiable surrogate for the 
position function and further approximate the Problem PI with 
an easier nonlinear program. 

Since the position function is defined by an indicator 
function (Equation [T]i, we can use a sigmoid function to 
approximate this indicator function, i.e.. 


l{sd,AH) < 0} 


exp{-l3sdj,d{H)) 

1 -f exp(-^sd3,d(-ff)) ’ 


(3) 


where /3 > 0 is a scaling constant. It is obvious that this 
approximation is in the range of [0.5,1) if Sdj,d{H) < 0 and 
(0,0.5] if Sdj,d{H) > 0. The following theorem shows that 
we can get a tight bound by this approximation. 


Theorem 1. The difference between the sigmoid function gij 
and the indicator function l{sdjA^) ^ bounded as: 


\gij - < 0}| < 


1 


where (5^ = min|sdj,d 


1 -I- exp{j56ij) 


9ij — 


exp(-^ J2k = i o^kSdj.d) 


and 


l-|-exp(-,8 oikSdj.d) 

Sdj,d represents SdjA^’kiQi)) for notational simplicity hence¬ 
forth 


Proof For Sdj,d > 0, we have Ijsdj.d < 0} = 0 and 6ij < 
Sdi 4 , thus. 


gij - I{sdj.d < 0}| < 


1 

1-f exp(/3(5y Oik) 


For Sd,j,d < 0, we have I{sdj.d < 0} = 1 and Sij < -Sd,j,d, 

thus, 

\gij A 0}| 

^ 1 

1 -f exp(/35y J2kA Ok) 
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Since 


ttk = 1, we can get 


Qij ^ 0}| ^ 


1 

1 + exp{pdij)' 


This completes the proof. 


(4) 

□ 


This theorem tells us that the sigmoid function is asymptotic 
to the indicator function especially when /3 is chosen to 
be large enough. By using this approximation, the position 
function can be correspondingly approximated as 


R{dj,H) 


1+ E 

d^D.d^dj 


exp{-l3sd^^diH)) 

1 + exp{-/3sdj,diH))' 


(5) 


which becomes differentiable and continuous. 

Then it is trivial to show the approximation error of position 
function, i.e.. 


R{dj,H) - R{dj,H) 


^ E Iffy - < 0} 

d^D,d^dj 


. PI-1 

1 + exp(^(5y)' 

( 6 ) 

Suppose 1000 documents exit in the document set D and 
Sij = 0.04. By setting fd = 300, the approximation error of 
the position function is bounded by 


R{dj,H) - R{dj,H) 


< 0.006, 


(7) 


which is tight enough for our problem. 

In this way, the original Problem PI can be approximated 
by the following problem 


max 


s.t. 


-Y — Y ^— 
^ it 1^*1 fern, 77) 

X^a/c = 1 

k=l 

0<a, < 


(P2) 


Using the settings identical to Figure [T] Figure |2] plots the 
graphs of the original objective function (OOF) in Problem PI 
and the approximated objective function (AOF) in Problem 
P2. As shown in the plot, the trend of the AOF is close to 
that of the OOF. The weights generating the optimal MAP 
almost remain unchanged in these two curves. From this ex¬ 
ample, it is illustratively shown that the original noncontinuous 
and nondifferentiable objective function can be effectively 
approximated by a continuous and differentiable function. The 
following lemma and theorem will theoretically prove this 
conclusion. 


Theorem 2. The error between the OOF in Problem PI and 
the AOF in Problem P2 is bounded as 


|A-A|< 


(|J|-1)(£ + EJA|) 

2L(1 -f exp(/3(5y)) 


( 8 ) 


where A and A denote the objective function in Problem P2 
and Problem PI, respectively. 



Fig. 2; Comparison of the OOF in Problem PI and AOF in 
Problem P2. (/3 = 200) 


Proof. For the approximation error, we have 

j{R-R) 


L 

|A-Ai = TEnuE 


L ^ \Di\ 

i=l I j = l 


RR 


where R denotes R{dj,H) for notational simplicity. Since 
77 = 1 + Ed^d, and i? = 1 + Ed^d, < 0} 

are strictly positive, we have 

jER-R) 


RR 

R-R 


RR 

According to Equation |6] we have 

(|77|-1)(A + EJA|) 


|A-A|< 
This completes the proof. 


2L(1 + exp(/3(5y)) 


(9) 

□ 


This theorem indicates that the OOF in Problem PI can 
be accurately approximated by the surrogate defined by the 
position function (|5]l in Problem P2. For example, if \D\ = 
10000, L = 200, E IA| = 500, /3 = 300 and = 0.04, 
the absolute discrepancy between the objectives in Problem 
PI and P2 is bounded by 

|A-A| < 0.1. 


This discrepancy is within an acceptable level and will de¬ 
crease with the growth of the query size L and the value of 

/3- 

The constraints of weights in Problem P2 are of practical 
significance because these weights can be regarded as proba¬ 
bilities drawn from a distribution over the constituent rankers. 
However, adding constraints increases the difficulty of solving 
this optimization problem. Intuitively, the normalization of 
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weights assigned for ranking scores is nonessential because the 
ranking position is determined by the relative values of ranking 
scores. Take the toy in Table |T] as an example, the weights 
(3.5,1.5) result in the identical Ensemble 2 to (0.7,0.3). The 
lemmas and theorems below prove the hypothesis that this 
constrained nonlinear program can be approximated by an 
unconstrained nonlinear program. 

Lemma 1. Problem P2 is equivalent to the following problem: 


max 






\Di\ 

E 


1=1 


j_ 

R 


{P3) 


where = 1 + Y.dGD,d^d, h’ h 

e^-p{-PY.kil^kSd.,d{4>k{qi))) , _ q ', , ^ 

- i - and at = — -kt —, Ofe > 

i+exp(-/3 Y.ki\ °‘kSdj,d{4>kiqi))) T,k£i 

0, fc = 1, 2,..., 

Since = 1, it can be straightforwardly proved that 

Problem P3 is equivalent to Problem P2. 


n 1 1 rr ; . / fd J 2 k^l'^'k^d, ,d{ 4 >kiqi))) 

Remark 1. If we let gP = - ^-, 

^ l+exp{-PJ2k£l<^k^dj,di<Pk(qi))) 

Theorem\J]applies for both gij and gP as well. 


The following theorem states that Problem P3 can be 
surrogated by an easier problem. 


Theorem 3. Consider the following problem 


max 



i=l 


1 

1 ^ 


\D.\ 

E 


1=1 


fi_ 

i?'’ 


(P4) 


where i?' = 1 + 'Yhd^D d^d 9ij- ^ denote the ob¬ 

jective function in Problem P3 and Problem P4, respectively. 
Then, we have the following bound for the absolute difference 
between A and A' 


|A-A'l < 


g(^ + StilA|) 

2L 


( 10 ) 


where e = e' + e, e' = |i?' — i?| and e 


R-R 


Proof. From Lemma [T] and Lemma [T] we can derive the 
following bound. 


|A-A'| 

= -E—E 


j{R' - R) 

R'R 


Since R' = 1 + Y.d^d, 9^] R = 1 + Y.d^d, 9i] are strictly 
positive, we have 

i {^2id^dj 9ij 9ij^ 

(1 + J2d^dj 9ij)i^ + J2d^dj 9ij) 

_ 3 Y^d^dj |(g»j - < 0}) + (i{gdj,rf < 0} - g^3)\ 

(1 + J2d^dj 9ij){^ + Sd/dj 9ij) 


According to the general triangle inequality, we can draw an 
upper bound for the term in numerator 

^ ^ \i9ij ~ Ijsdj.d < 0}) + (I{sdj,d < 0} — gij)\ 

d^dj 

- E |5*1 “ lisdjA < 0}| + ^ |l{sd,-.d < 0} - g^J\ 

d^dj d^dj 

< e. 


Then, it is trivial to get 



g(^ + EtilA|) 


2L 

This completes the proof. 


( 11 ) 

□ 


Since the differences e' and e are small enough. Problem P4 
can accurately approximate Problem P3. This theorem tells us 
that the AOF is also determined by the ranking positions, i.e., 
the relative values of ranking scores, thus the normalization 
constraints in Problem P2 can be removed. Taking Lemma 
[T] and Theorem |2] into account, we can trivially draw the 
following corollary. 


Corollary 1. Problem PI can be approximated by Problem 
P4. 


In the next section, we focus on proposing algorithms that 
solves Problem P4. 


IV. Algorithm 

In order to solve Problem P4, we propose algorithms 
according to the data settings—batch setting and online setting. 
In the batch setting, all the queries and ranking scores given 
by constituent rankers are processed as a batch. Based on 
the batch data, the weights over constituent rankers are com¬ 
puted by maximizing the MAP. Two algorithms, gEnM.BAT 
and gEnM.IP, are reported in this setting. The potential for 
the batch algorithms merit consideration for those systems 
containing complete data. Take academic search engine as 
an example. The titles can be seen as queries while the 
abstracts and contents of publications can be regarded as 
relevant documents. So a batch can be established to train 
the proposed model. 

In many IR environments such as recommendation systems 
in E-commerce, however, the queries and ranking scores are 
generated in real time so as to construct data sequences at 
different times. Thus, we will secondly propose an online 
algorithm, gEnM.ON, for dealing with these data sequences. 
The online algorithm is more scalable to large data sets 
with limited storage than the batch algorithm. In the online 
algorithm, the queries as well as corresponding ranking scores 
are input in a data stream and processed in a serial fashion. 

A common assumption for the aforementioned frameworks 
is that the relevant documents are known. However, the knowl¬ 
edge of relevant documents are unknown in many modern IR 
systems such as search engines. For this IR environment, we 
further propose an unsupervised ensemble model, UnsEnM, 
which makes use of a co-training framework. 
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A. Batch Algorithm: gEnM.BAT 

Although many sophisticated methods can be applied for 
finding a local optimum, we first propose a revised Newton’s 
method. Major modification includes the approximation of 
gradients and Hessian matrix. 

For notational simplicity, we utilize: 


distribute different starting points onto different cores for 
parallel computing. 

The batch algorithm is summarized as follows. We note that 
otp and Sdj,di<P{qi)) represent the vectors with elements ap and 
Sdj,d{4’k{qi)), respectively, and that p = 1,2,..., P indexes P 
initial values. 


G^J 

= E 

9ij^ 

(12) 


dGD,d^d 

d 


/^k . 

^ij ■ 

= E 

99id. 

da', ’ 

(13) 


d^D^d^dj 




= E 

dg[j . 

(14) 


d^D,d^dj 


/-ikl . _ 

da'i.da',' 

(15) 


d^D ,d^dj 

Under those notations, the first and second derivative of the 
objective function in Problem P4 can be written as 


dai 


and 


92 A' _ 1 1 

' ~T 2^ in.-1 


, 1^5.1 

-JG% 

8 

(16) 9 

10 




11 



12 


^ 7~{ 1 ^' 


\Di\ 

E- 

i=i 


(1 + G .,)2 


•H(a) = 


~dal[doi^ 

d^A' 


da',^da'^ 

d^A' 


da'. doL 


da'^da'-^ 




a^A' 


a^A' 

da'r,da'^ 


d'^A' 


K<j) 


As stated by Theorem | 6 ] in Appendix |B] the addends in 
the first derivative can be estimated by zeros under certain 
conditions. This approximation also applies for the second 
derivative as well as the Hessian matrix since both contain the 
first derivative item. The advantages of using this approxima¬ 
tion are two-fold: a) the computation of Hessian is simplified 
since many addends are set to zeros under certain conditions; 
and b) the computations of G^^, Gij, G\j and G* can be 
carried out offline before evaluating the derivative and Hessian, 
which makes the learning algorithm inexpensive. 

Since the objective function in Problem P4 is nonconvex, 
multiple local optimums may exist in the variable space. 
Therefore, different starting points are chosen to preclude the 
algorithm from getting stuck in one local optimum. The largest 
local optimum and the corresponding weights are returned 
as the final solutions. To accelerate the algorithm, we can 


Algorithm 1 gEnM.BAT (Generalized Ensemble Model by 
Revised Newton’s Algorithm in Batch Setting.) 

Require: Query set Q, document set D, relevant document 
set \Di\ with respect to qt € Q, ranking scores Sd{4>k{qi)) 
with respect to ithe query, fcth method (j)k and document 
d € D, a number of initial points a.p and a threshold 
e = 0 for stopping the algorithm, 
for each cy.p do 

Set iteration counter f = 1; 

Evaluate A'*; 
repeat 

Set t = f -(- 1; 

Compute gradient V^t-iA' and Hessian matrix 
(Algorithm |2]); 

Update al = -f V„t-i A'; 

Evaluate A'*; 


unta A'* - A'* 
Store dtp 

end for 
return ct’s. 


-1 


< e 


(17) 

respectively. According to the second derivative, the Hessian 
matrix is defined by 

d'^A' 


■ (18) 


A drawback of the conventional Newton’s method lies in 
that it is designed for unconstrained nonlinear programs while 
our problem requests a nonnegative. Thus applying the above 
algorithms may result in negative weights. The strategy for 
avoiding this shortcoming is to set the final negative weights 
to zeros. As a matter of fact, the rankers with negative 
weights play a negative role in the ensemble model. Thus, 
the ignorance of those rankers are reasonable in practice. 

B. Online Algorithm: gEnM. ON 

In the previous two subsections, we have presented the 
learning algorithms for generating gEnM by batch data sets. 
In contrast to the batch setting, the online setting provides the 
gEnM a long sequence of data. The weights are calculated 
sequentially based on the data stream that consists of a 
series of time steps t = 1,2,...,T. Eor example, the gEnM 
is constructed based on the new queries and corresponding 
rankings given at different times in a search engine. The final 
goal is also to maximize the overall MAP on the data sets. 


max 


1 1 


J 


- V— / 

^ t=l j=l ^ '^deD,d=jtdj 9i 


(19) 




As a matter of fact, the presented batch algorithms can be 
applied directly in the online setting by regarding the whole 
observed sequences as a batch at each step. In doing so, 
however, the overall complexity is extremely high since the 
batch algorithm should be run once at each time step. 

In the online setting, the subsequent queries are not available 
at present. An alternative optimization technique should be 
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Algorithm 2 Approximated Derivative and Hessian Compu¬ 
tation Algorithm. 

Require: Query set Q, document set D, relevant document 
set \Di\ with respect to qi G Q, ranking scores Sd{4>k{(li)) 
with respect to ithe query, fcth method (pk and document 
d G D, current 
1 : for Qi G Q do 
2 : for dj G \Di\ do 

3: Set Gij, Gfj, and to zeros; 

4: tor d G D do 

5 : Sdj^d{(pk{qi)) ^ SdjiPkiqi)) - Sdi4>kiqi)y, 

6: ^ l+exp(-/3c«*-isd^,d(0(g.)))’ 

7: Gij G- Gij + g[j{a.p~^) 

8: if < a*p~^Sd^,d{<P{q^)) < f then 

9: Gjj ^ Gfj + 

/3^ Sd,-.d ((Afe fe)) Sd,. .d (</>; fe) )5'y (ap“ ^) (1 

10: Gij G- Gij + /3sdj,d(^fc((?i)); 

11 : G\j G- G\j + Psdj,d{4'l{qi))\ 

12 : else 

13: G^ ^ G^; 

14: G'lj G- G^j\ 

15: G\j G- G\j-, 

16 : end if 

17: end for 

18 : end for 

19: end for 

20 : Compute gradient V^t-iA' (Equation l40li 

and Hessian matrix (Equation fTsTi 

21 : return V„t-iA' and 'HioLpr^). 

OLp ^ y 


considered to prevent from focusing too much on the present 
training data. To distinguish with the notation in the batch 
setting, we let x be the query and suppose xi, X2, ...Xt,... 
are the given query at time t in the online setting. Here, we 
assume that these sequences are given with the grand truth 
distribution p(x). Thus, the objective function of MAP can be 
defined as the expectation of average precision, i.e., 

00 

Jioi) 

t=l 

= Ep[/(x, a)], 

where 


/(x,a) 


75*, 

—E 


7 = 1 


_ i _ 

^ ^y^dGDjdi^dj 9xtji^ ) 


The expectation cannot be maximized directly because the 
truth distribution p(x) is unknown. However, we can estimate 
the expectation by the empirical MAP that simply uses fi¬ 
nite training observations. A plausible approach for solving 
this empirical MAP optimization problem is that using the 
stochastic gradient descent (SGD) algorithm which is a drastic 
simplification for the expensive gradient descent algorithm. 
Though the SGD algorithm is a less accurate optimization 


algorithm compared to the batch algorithm, it is faster in terms 
of computational time and cheaper in terms of storing memory 
iia. Another advantage is that the SGD algorithm is more 
adaptive to the changing environment in which examples are 
given sequentially ca. 

Eor our problem, the SGD learning rule is formulated as 

at+i = at+77tV/(xt+i,Q:t) ( 21 ) 


where rjt is called learning rate, i.e., a positive value depending 
on t. This updating rule is validated to increase the objective 
value at each step in terms of expectation, which can be 
verified by the following theorem. 

Theorem 4. Using the updating rule ( 1271) . the expectation of 
average precision increases at each step, i.e.. 


Ep[/(x,at+i)] >Ep[/(x,at)] 


Proof Since Ep[/(x, at+i)] - Ep[/(x,af)] = 

Ep[/(x, at+i) — /(x, at)], we only need to show 
/(x,at+i) - f{x,at) > 0 . 

Since 

/(x,at+i) - /(x,at) = E 


E 


jT.diidjia'xjic^tGi) - g'xjiot't)) 


^ V + Ed/d, 9 xjio:t)) ) ’ 

we need to verify g'xj{ct't+i) ~ 9xji<^t) ^ 0- According to the 
denotation of q'--, we have 


9xj (^t+1) 9xj ) 

where r(a') = 

Since 


T(a'J -T(a't^i) 

(1-f T(a't))(l-f T(a;+i)) 


Tja't) 

^(a't+i) 


= exp(/377tVf(x,aJ)s(()i)) 

> ea:p( 0 ) 

= 1 , 


( 22 ) 


we can conclude that 


Tia't) -Tia't+i) > 0 . 

This completes the proof. □ 

The learning rate rj plays an important role in the updating 
(Equation l22ll. hence an adequate pt will enhance the online 
algorithm to converge. Define pt = 1/f in this article, then we 
have the following well-known properties: 


00 


E 

t 

(23) 

00 

E 7?t = 00 . 

(24) 


t 


Since it is difficult to analyze the whole process of online 
algorithm Ea, we will show the convergence property around 
the global or local optimum in the following analysis. 
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Lemma 2. If at is in the neighborhood of the optimum a*, 
we have 

(at-a*)V/(x,at) <0. (25) 

The proof of is straightforward referring to Equation 
This lemma states that the gradient drives the current point 
towards the maximum a*. In the stochastic process, the 
following inequality holds 

(at-a*)Ep[V/(x,at)] <0. (26) 

Lemma 3. If at is in the neighborhood of the optimum a*, 
we have 

lim V/(x, < cx). (27) 

t—foo 

The proof is given in the Appendix. For the stochastic 
nature, the expectation of V/(x, also converges almost 
surely, i.e., 

lim Ep[V/(x, < oo. (28) 

Theorem 5 ( ifTTl l. In the neighborhood of the maximum a*, 
the recursive variables a converge to the maximum, i.e.. 


Algorithm 3 gEnM.ON (Generalized Ensemble Model by 
Online Algorithm.) 

Require: Query set Q, document set D, relevant document 
set \Di\ with respect to qt G Q, ranking scores Sd{4>k{(li)) 
with respect to ithe query, fcth method fk and document 
d G D, a number of initial points a.p and a threshold 
e > 0 for stopping the algorithm. 

1 : for each a.p do 

2 : Set iteration counter f = 1; 

3: Evaluate A'*; 

4: repeat 

5: for each qt G Q do 

6 : Set f = t + 1; 

7: Compute gradient V^t-iA' with respect to qt 

(Algorithm 121); 

8 : Update a* = 

9: end for 

10 : Evaluate A'*; 

11 : until |A'* — A'*“^| < e 

12 : Store cCp 

13: end for 

14: return a’s. 


lim at = a*. (29) 

t—^OO 

Proof. Define a sequence of positive numbers whose values 
measure the distance from the optimum, i.e., 

ht+i - ht = (at - a*)'^. (30) 

The sequence can be written as an expectation under the 
stochastic nature, i.e., 

Ep[/it+i-/it] = 2pt(at-a*)Ep[V/(x, Q!t)]+Pt^Ep[V/(x, a)^] 

(31) 

Since the first term on the right hand side is negative according 
to (l26l l. we can obtain the following bound; 


Ep[/it+i - ht] < Pt'Ep[V/(x, at)^]. (32) 

Conditions (l24l i and (l28l l imply that the right hand side 
converges. According to the quasi-martingale convergence 
theorem im, we can also verify that ht converges almost 
surely. This result implies the convergence of the first term in 

(EB. 

Since rjt does not converge according to (l23l l. we can 
get 

lim (at - a*)Ep[V/(x, at)] = 0. (33) 

t—foo 

This result leads to the convergence of the online algorithm, 
i.e., 

lim at = a*. 

t—¥CO 

This completes the proof. □ 

Based on the learning rule (l2TT l. the online algorithm for 
achieving the ensemble model is summarized below. 


C. Unsupervised Algorithm: UnsEnM 
The proceeding proposed algorithms for both batch setting 
and online setting are based on the knowledge of labeled data, 
which has been regarded as supervised learning. As a matter 
of fact, in the community of conventional information retrieval 
systems, labeled data are difficult to obtain in general. Under 
this condition, unsupervised learning plays a crucial role. The 
inspiration of unsupervised algorithm for solving Problem P4 
comes from the idea of co-training that is based on the belief 
that each constituent ranker in the ensemble model can provide 
valuable information to the other constituent rankers such that 
they can co-learn from each other ifTSll . In order to utilize 
this collaborative learning scheme, the gEnM requires all 
constituent rankers are generated by unsupervised learning. In 
each round, the ranking scores of one of the constituent rankers 
are provided as fake labeled data for other rankers to refine 
the weights. Iteratively learning from the constituent rankers, 
the ensemble model may result in an overall improvement in 
terms of MAP. 

We modify the objective function in Problem P4 by adding 
a penalty item so that the refined ranking does not depend on 
the fake label too much. The modified objective function is 
defined as 

max A' - iff ^ ^ ^ \\Hd{qi) - Sd(fk[q^))\\^ 

d^D 

(P8) 

where Hd{qi) = akSdifkiqt))- 


Let r denote the objective function in Problem P8. The 
second derivatives of T can be written as follows: 

^ ^ ^ X! X! i^difkiqi)) ■ Sdifiiqi))) 

qi&Q deD 

( 34 ) 


dttkai dauai 
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The approximation of Hessian matrix reported in Algorithm 
2 can be employed here, however, it is time-consuming doing 
so since the unsupervised algorithm requires a large number 
of iterations to converge and the Hessian should be calculated 
at each iteration. Therefore, the learning rule of the online 
algorithm gEnM.ON is applied for the unsupervised algorithm. 
It is noteworthy that the gEnM.ON can be effortlessly modified 
to fit this unsupervised co-training scheme. The algorithm is 
described below. 

Algorithm 4 UnsEnM ( Uns upervised Ensemble Model.) 

Require: Query set Q, document set D, ranking scores 
Sd{(j)k{qi)) with respect to (the query, feth method (pk and 
document d € D, a number of initial points otp, a thresh¬ 
old Cs for Sd{4>k{qi)) to choose fake relevant documents 
and a threshold e > 0 for stopping the algorithm. 

1 : for each do 

2: Set iteration counter t — V, 

3: Evaluate A'*; 

4: repeat 

5: for each € $ do 

6: Set f = t -f 1; 

7: Refresh fake relevant document set \Di\ = 0; 

8: Construct Sd that excludes Sd{4>k)', 

9: Construct cxp that excludes 

10 : for Qt € Q do 

11 : if Sd{(pk{qi)) > Es then 

12: Construct fake relevant document set \Di\ ^ 

(u |A|; 

13: end if 

14: end for 

15: Compute gradient V^t-iA'; (Algorithm |2]l 

16: Update ttp = -f A'; 

17: end for 

18: Reconstruct a.p that includes 

19: Evaluate A'*; 

20: until I A'* — A'‘”^| < e 

21: Store aPp 

22 : end for 
23: return a’s. 


V. Empirical Experiment 

A. Experiment Setup 

The proposed methods were evaluated on four stan¬ 
dard medium-sized ad-hoc document collections, i.e., MED, 
CRAN, CISI and CACM, which can be accessed freely from 
the SMART IR SysterrQ In order to test the proposed methods 
on heterogeneous data, we utilized the merged collection (MC) 
advocated by ifT^ . which combines the four collections. The 
basic statistics of the test data are summarized in Table HIl The 
following minimum pre-processing measures were taken for 
the collections before evaluating the proposed methods: a) stop 
words were removed from the corpus by referring to a list of 
571 stop words provided by SMART*; b) special symbols, 

'Available at: ftp://ftp.cs.coraell.edu/pub/smart. 


including hyphenation marks, were removed; and c) those 
words with unique appearances in the corpus were removed. 
We note that the incomplete documents and queries in CISI 
and CACM were retained in the experiments. 


TABLE II: Data characteristics. 


Data 

Subject 

Document # 

Query # 

Term # 

MED 

Medicine 

1,033 

30 

5,775 

CRAN 

Aeronautics 

1,400 

225 

8,213 

CISI 

Libraiy 

1,460 

112 

10,170 

CACM 

Computer 

3,204 

64 

9,961 

MC 

Multiplicity 

7,097 

431 

27,784 


The constituent rankers, in essence, are important factors 
that influence the results. Eour rankers recommended by 
lfT2l . namely //-iT^-based ranker (TEIDE) HI, Latent Semantic 
Analysis (ESA) fH, probabilistic Latent Semantic Indexing 
(pLSI) ll20l . Indexing by Latent Dirichlet Allocation (EDI) 
im, were utilized in this paper for assembling the gEnM. 
In brief, TEIDE represents documents by a tf-idf weighted 
matrix; ESA projects each document into a lower dimensional 
conceptual space by applying Singular Value Decomposition 
(SVD); pLSI is a probabilistic version of ESA; and EDI 
represents each document by a probabilistic distribution over 
shared topics based on Latent Dirichlet Allocation (EDA) 
im. These rankers are all unsupervised rankers and thus are 
trivial to be trained in the unsupervised setting. In addition 
to this training requirement, the rankers contain different 
information describing each corpus, such as information of 
keyword matching, concepts, or topics. 

Since the four rankers represent documents and queries into 
vectors, the ranking scores are the cosine distances (or cosine 
similarities) between the vectors of documents and queries. 
Subsequently, the ranking scores of gEnM can be generated 
with appropriate adjustments to the weights being made for the 
ranking scores of the four rankers. Eor formulating Problem 
P4, we set /3 = 200. Einally, the proposed algorithms can be 
implemented to calculate the optimal weights for gEnM. 

In order to address the over-fitting problem of batch algo¬ 
rithms, we adopted the two-fold cross validation for testing the 
gEnM.BAT and gEnM.ON. A difference for the gEnM.ON is 
that the training queries and corresponding relevant documents 
were given sequentially at each step. The performance metric 
was the mean value of the MAPs in the two-fold cross 
validation. As for the UnsEnM, the ranking scores of different 
constituent rankers are provided as labeled data for other 
rankers in different rounds. The UnsEnM was then evaluated 
by means of MAP on the real labeled data. 

As discussed in Section HVl the proposed algorithms would 
benefit from different initial weights. Choosing the proper 
initial points for nonlinear program is an open research issue. 
In our tests, we utilized the operational criterion of selecting 
the best. In other words, we tested performances for differ¬ 
ent initial weights and selected the one that generated the 
maximum retrieval performance in terms of MAP. In this 
experiment, we first set the initial weights to binary elements, 
i.e., a € B"*. The reason of doing so lies in that the constituent 
rankers are initially active in some of the rankers and inactive 
in others, which reflects our heuristics at the first step. Since 
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the EnM has been shown prior to the four basis rankers 
by m, the EnM model was used as baseline methods for 
comparison. 


B. Experimental Results 

The experimental results are shown in Table |III1 We have 
considered three measures for comparing the performances 
of the proposed algorithms: mean average precision (MAP), 
(average) precision at one document (Pr@l), and (average) 
precision at five documents (Pr@5). Indeed, the gEnM per¬ 
formance is always better than the EnM. Since the EnM is 
also solved by a batch algorithm, we conduct the Wilcoxon 
signed rank test to evaluate the difference between EnM and 
gEnM.BAT. We see that, in some cases, the difference is 
statistically significant with a 95% confidence. We emphasize 
that the Pr@l of gEnM is 48% higher than that of EnM for 
the CISI data set and is close to 100% for the MED. In other 
words, the retrieved documents by gEnM are more relevant 
at high ranking positions, which is desirable from the user’s 
point of view. 

Erom Table |III1 we also see that the performance of 
gEnM.ON is better than the gEnM.BAT. The slight priority 
of gEnM.ON is due to the approximation of Hessian for the 
gEnM.BAT. However, the gEnM.ON is more expensive than 
gEnM.BAT because of iterative use of queries for calculation. 
Having said that, gEnM.ON can be used in a specific system 
where data are given in sequence. Since the knowledge of 
relevant documents is unknown in unsupervised learning, 
the performance of UnsEnM is inferior to the supervised 
algorithms. However, the results on the more heterogeneous 
data set MC are surprisingly the best among the proposed 
algorithms. The supervised algorithm may work well when 
tested against similar queries and documents in the homoge¬ 
neous data. Yet the unsupervised algorithm does not fit the 
training data as much as the supervised algorithm does and 
thus the superiority becomes more obvious when tested on 
more heterogeneous data. 

Eigure [3] shows the precision-recall curves of the examined 
methods. 

Eor illustrating the learning abilities of the gEnM.ON and 
UnsEnM, the learning curves on the MED data are reported 
in Eigure |4] The results on the other data sets are very similar. 
The tolerance is set to le —4 and the number of iteration is set 
to at least 10 in order to clearly view the changes of objective. 
The online learning curves validates the convergence property 
of gEnM.ON. Amongst these curves, several scenarios, such 
as when a = (1,1,1,!)^ and a = (1,0, 0,0)^, imply that 
the gEnM.ON may occasionally fail for some queries that 
are not similar to the previous sequences and not near the 
local optimum. With the increase of iterations, however, the 
impact of those queries may mitigate due to the majority 
effect. Apart from these specific cases, the gEnM.ON is able 
to gradually learn from the sequences, which is consistent with 
the theoretical analysis. 

The UnsEnM also converges with the increase of iterations. 
We can see that in the case of a = (1,0, 0,0)^ a ranker which 
is regarded as supervised labels may dramatically decrease the 


MED CRAN 



Eig. 3: Precision-Recall Curves for the testing data sets. 


MC 



Eig. 3: Precision-Recall Curves for the testing data sets, 
(continued) 


objective function. In most cases, the impact of such rankers 
can be balanced out by other rankers. As a matter of fact, this 
phenomenon is similar to gEnM.ON since the data are given 
sequentially in both cases. 

VI. Conclusions and Discussions 
In this paper, we propose a generalized ensemble model, 
gEnM, which tries to find the optimal linear combination 
of multiple constituent rankers by directly optimizing the 
problem defined based on the mean average precision. In 
order to solve this optimization problem, the algorithms are 
devised in two aspects, i.e., supervised and unsupervised. 
In addition, two settings for the data are considered in the 
supervised learning, namely batch and online setting. Table 
ITVl summarises the algorithms with potential applications in 
practice. In brief, the gEnM.BAT can be used in those IR 
systems that have the knowledge of labeled data, such as 
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Fig. 4: Learning curves of EnM.ON and UnSEnM with different initial points on MED. 
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Fig. 4: Learning curves of EnM.ON and UnSEnM with different initial points on MED. (continued) 
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TABLE III: Comparison of the algorithms for gEnM and baseline methods. Pr@l denotes the precision at one document 
and Pr@5 the precision at five documents. An asterisk (*) indicates a statistically significant difference between EnM and 
gEnM.BAT with a 95% conhdence according to the Wilcoxon signed rank test. 


Collection 

Measure 

EnM 

gEnM.BAT 

gEnM.ON 

UnsEnM 

impr(%) 


MAP 

0.6420 

0.6458 

0.6467 

0.6465 

+0.6 

MED 

Pr@l 

0.8667 

0.9333 

0.9333 

0.9333 

+7.7* 


Pr@5 

0.7867 

0.8133 

0.8133 

0.8133 

+3.4* 


MAP 

0.3766 

0.3937 

0.3972 

0.3972 

+4.5 

CRAN 

Pr@l 

0.6133 

0.6622 

0.6667 

0.6356 

+8.0* 


Pr@5 

0.3742 

0.4080 

0.3991 

0.4018 

+9.0* 


MAP 

0.1637 

0.1945 

0.1816 

0.1825 

+18.8* 

CISI 

Pr@l 

0.3289 

0.4868 

0.3684 

0.3947 

+48.0* 


Pr@5 

0.2974 

0.3237 

0.2868 

0.3079 

+8.8 


MAP 

0.1890 

0.2166 

0.2256 

0.1745 

+14.6* 

CACM 

Pr@l 

0.3654 

0.3846 

0.4423 

0.3077 

+5.3 


Pr@5 

0.2192 

0.2500 

0.2538 

0.2000 

+14.1* 


MAP 

0.2768 

0.3162 

0.3099 

0.3169 

+14.2* 

MC 

Pr@l 

0.4204 

0.5196 

0.5300 

0.5274 

+23.6* 


Pr@5 

0.307 

0.3614 

0.3624 

0.3629 

+17.7* 


academic search engines; the gEnM.ON is appropriate for real¬ 
time systems where the data is given in sequence, such as 
movie recommendation systems; and the UnsEnM is proposed 
for those systems without the knowledge of labeled data, such 
as search engines. 

An experimental study was conducted based on the public 
data sets. The encouraging results verify the effectiveness of 
the proposed algorithms for both homogeneous and hetero¬ 
geneous data. The gEnM performance is always better than 
the EnM, except for the case of UnsEnM on CACM. Briefly, 
the difference between gEnM.BAT and EnM is statistically 
significant in most cases; the gEnM.ON performs the best 
among the proposed algorithms for the MED, CRAN and 
CACM; and the unsupervised UnsEnM is more applicable for 
heterogeneous data than the supervised algorithms. 

While we have shown the effectiveness of the proposed 
algorithms, we have not yet analyzed the computational com¬ 
plexity of the algorithms. Though we simplified the compu¬ 
tation of the derivative and Hessian matrix, we were unable 
to reduced the complexity of the batch algorithm based on 
Newton’s method. A possible future direction is to exploit 
cheaper and faster algorithms for the batch setting. Another 
interesting research topic is the selection of initial weights, 
which is actually an open research issue in nonlinear pro¬ 
gramming. 

Apart from the potential improvements with regard to 
algorithms, the selection of constituent rankers is an extremely 
important issue. This problem may be resolved if we can 
identify which ranker is redundant for the ensemble. In this 
paper, we use human heuristics for choosing the four rankers. 
However, a concrete framework to effectively evaluate the 
contribution of each ranker is no doubt a subject worthy of 
further study. 
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TABLE IV: Summary of the algorithms: gEnM.BAT, gEnM.ON and UnsEnM. 


Algorithm 

Category 

Setting 

Application 

gEnM.BAT 
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gEnM.ON 

supervised 

online 

movie recommendation, etc. 

UnsEnM 

unsupervised 

batch 

search engine, etc. 
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Appendix A 

Derivation of the derivative of A' 

(1) Derivation of the first derivative 

According to the calculus chain rule, the derivative of 
objective in Problem P4 with respect to ak,k = 1,2,.., is 


dA' _ 1 1 


_A 2Ek 

•J 2-^df^dj dot'. 




2 ’ 


(35) 


where 


dq' 

(1 - 9[j)- 


(36) 



(2) Derivation of the second derivative 

Also by the chain rule, the second derivative with respect 
to aj, Z = 1 , 2 ,.., is 


Fig. 5: The approximation of sigmoid function through the 
centered linear approximation method. (/3 = 300) 


1 A 1 

T, r j I r) ■ I 


Proof. We apply the centered linear approximation method to 
the approximation of the sigmoid function as shown in Figure 

-iE sSfed + lif E ^(1 + Esi# 




E- 

1=1 

where 


(1 + d'ijY 


d^q' dq' 


da'f.dai 


dai 


(37) 

(38) 


and can be calculated by Equation 


.fix), if - - <x < -: 


fix) ^ < 


0 , 

1 , 


if a; < — 


if X > 


/S’ 


(41) 


Hence /(x)(l — fix)) = 0 if x < —^ or x > This 
completes the proof. □ 


Appendix B 

Approximation of the derivative of sigmoid 
function 


For notational simplicity, we begin by considering the 
following sigmoid function: 


fix) 


1 

1 + exp(/3x) 


(39) 


Theorem 6. The derivative of function 091) can be approxi¬ 
mated as follows: 


dfjx) 

dx 


- difix) - fix)), 

2 2 

0, if X < -- or X > -. 

( 40 ) 


We note that this approximation is more precise with a larger 

d- 


Remark 2. The derivative function 061 ) can be approximated 
by: 


{ - dsd^.dfkifigiji^ - 9ij), 

“ I <^0L'kSd,A'fki9i)) < 

k 

0 , 

if the scaling constant d A large. 


otherwise. 

(42) 


Appendix C 
Proof of Lemma[3] 

In this section, we only sketch the proof of Lemma [3 


if the scaling constant d A large. 
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Sketch of Proof. In this proof, we use simple symbols for 
clarity. For example, g{at) denotes gP{a[). 

V/(x, at+if - V/(x, atf 

2 


1 -g(«t+i)) ] 

(1+ ^K+i))^ / 


1 jPJ2sgiat){i - g{at)) \ 

(1 + Ed5^d,5(at))^ ) 

sgiat+i){l - g(at+i))^ 


For g{at+i) - 5 ( 04 + 1 )^, we have 
giat+i) - g{at+if < 


1 


< 


Thus, we have 


2 + exp(/3 J2io:t + ??V/)s) 
1 

2 + exp(/3 X;’iV/s) ■ 


D 


V/(x,a4+i)2-V/(x,a4)2 




2 + exp(/3X;t7V/s) 


It is easy to show that the is the summand of a 

convergent infinite sum. This result implies that V/(x, 
converges because it is bounded and its oscillations are 
damped. □ 



